Class link

Class GitHub

1 Exploring data

The very first step of data modeling and machine learning is to understand your data. This critical procedure will determine what methods to be used in the following data analytics process. Whether a clear output variable is identified, the data type of that variable and scale of such variable are key questions to be addressed in the stage of exploratory data analysis. Visualizing the data is first and foremost of the entire data analytics process.

John Tukey (1977) remarks the over-emphasis on statistical significance or the hypothesis confirmation process leaves the other important part of data analysis amiss, which is what Garrett Grolemund and Hadley Wickham term as hypothesis generation. Tukey suggests that the Exploratory Data Analysis (EDA) is to suggest hypothesis confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.

John W. Tukey

“An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question” - John W. Tukey

John T. Behrens lists the objectives of EDA for researchers to:

  • Suggest hypotheses about the causes of observed phenomena
  • Assess assumptions on which statistical inference will be based
  • Support the selection of appropriate statistical tools and techniques
  • Provide a basis for further data collection through surveys or experiments

Grolemund and Wickham describe the EDA process as an iterative cycle:

  • Generate questions about data
  • Search for answers by visualizing, transforming, and modeling the data
  • Refine the research questions and/or generate new questions

At the end of the process, EDA will lead to a decision of what methods to adopt in the next stage.

2 Visualizing what data

2.1 Know the data:

  1. Type

    • Quantitative (numeric, continuous, interval, ratio)
    • Qualitative (character, categorical, nominal, ordinal)
  2. Scale

    • Measurement unit
    • if qualitative, number of categories

2.2 Chart chooser:

Chart thought starter

  1. Univariate

    • How the variable is distributed (Distribution)?
      • Examples: Histograms
    • How the variable is composed of (Composition)?
      • Examples: Pie charts
  2. Groups

    • How are the groups (discrete and categorical) compared (Comparison)?
      • Small number of groups (<5)
      • Large number of groups (>=5)
  3. Bivariate or Multivariate Relationship

    • Two variables without explicit dependent variable
    • More than two variables without explicit dependent variable
    • Dependent variable explicit (Regression)
  4. Time series

  5. Matrix

Scatter plot matrix

  1. Ensemble

Ensemble plot

2.3 Tools of Exploratory Data Analysis

  1. Univariate

  2. Frequency Table (descr::freq())

  3. Histogram (base::hist())

  4. Bar chart

  5. Pie chart

  6. Area chart

  7. Bivariate/Multivariate

  8. Qualitative (Groups/Categorical) 1. Bar chart 2. Line chart

  9. Quantitative (Continuous/Numeric) 1. Scatter plot 2. Bubble plot

  10. Time series

  11. Trend/line time series plot

3 Hands-on workshop: Exploratory Data Analysis with tables and charts

  1. EDA
## Gentle Machine Learning
## Exploratary Data Analysis
## Adapted from Grolemund, Garrett, and Hadley Wickham. 2018 
## R for data science. Ch.7 (https://r4ds.had.co.nz/).

# install.packages("tidyverse")
library(tidyverse)

# Plot diamonds data
attach(diamonds)
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut)) +
  theme_bw()

# Simple table
# %>% forward piping operator - forward programming
diamonds %>% 
  count(cut)
## # A tibble: 5 x 2
##   cut           n
##   <ord>     <int>
## 1 Fair       1610
## 2 Good       4906
## 3 Very Good 12082
## 4 Premium   13791
## 5 Ideal     21551
# Frequency table with chart
# install.packages("descr")
library(descr)
freq(diamonds$cut)

## diamonds$cut 
##           Frequency Percent Cum Percent
## Fair           1610   2.985       2.985
## Good           4906   9.095      12.080
## Very Good     12082  22.399      34.479
## Premium       13791  25.567      60.046
## Ideal         21551  39.954     100.000
## Total         53940 100.000
library(RColorBrewer)

# What is the carat variable?
descr(diamonds$carat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2000  0.4000  0.7000  0.7979  1.0400  5.0100
# A histogram divides the x-axis into equally spaced bins and then uses 
# the height of a bar to display the number of observations per each bin.
hist(carat)

# Another look
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5) +
  theme_bw()  # Why this looks different from hist(carat)

# Another another look
freq(diamonds$carat)

## diamonds$carat 
##       Frequency   Percent
## 0.2          12 2.225e-02
## 0.21          9 1.669e-02
## 0.22          5 9.270e-03
## 0.23        293 5.432e-01
## 0.24        254 4.709e-01
## 0.25        212 3.930e-01
## 0.26        253 4.690e-01
## 0.27        233 4.320e-01
## 0.28        198 3.671e-01
## 0.29        130 2.410e-01
## 0.3        2604 4.828e+00
## 0.31       2249 4.169e+00
## 0.32       1840 3.411e+00
## 0.33       1189 2.204e+00
## 0.34        910 1.687e+00
## 0.35        667 1.237e+00
## 0.36        572 1.060e+00
## 0.37        394 7.304e-01
## 0.38        670 1.242e+00
## 0.39        398 7.379e-01
## 0.4        1299 2.408e+00
## 0.41       1382 2.562e+00
## 0.42        706 1.309e+00
## 0.43        488 9.047e-01
## 0.44        212 3.930e-01
## 0.45        110 2.039e-01
## 0.46        178 3.300e-01
## 0.47         99 1.835e-01
## 0.48         63 1.168e-01
## 0.49         45 8.343e-02
## 0.5        1258 2.332e+00
## 0.51       1127 2.089e+00
## 0.52        817 1.515e+00
## 0.53        709 1.314e+00
## 0.54        625 1.159e+00
## 0.55        496 9.195e-01
## 0.56        492 9.121e-01
## 0.57        430 7.972e-01
## 0.58        310 5.747e-01
## 0.59        282 5.228e-01
## 0.6         228 4.227e-01
## 0.61        204 3.782e-01
## 0.62        135 2.503e-01
## 0.63        102 1.891e-01
## 0.64         80 1.483e-01
## 0.65         65 1.205e-01
## 0.66         48 8.899e-02
## 0.67         48 8.899e-02
## 0.68         25 4.635e-02
## 0.69         26 4.820e-02
## 0.7        1981 3.673e+00
## 0.71       1294 2.399e+00
## 0.72        764 1.416e+00
## 0.73        492 9.121e-01
## 0.74        322 5.970e-01
## 0.75        249 4.616e-01
## 0.76        251 4.653e-01
## 0.77        251 4.653e-01
## 0.78        187 3.467e-01
## 0.79        155 2.874e-01
## 0.8         284 5.265e-01
## 0.81        200 3.708e-01
## 0.82        140 2.595e-01
## 0.83        131 2.429e-01
## 0.84         64 1.187e-01
## 0.85         62 1.149e-01
## 0.86         34 6.303e-02
## 0.87         31 5.747e-02
## 0.88         23 4.264e-02
## 0.89         21 3.893e-02
## 0.9        1485 2.753e+00
## 0.91        570 1.057e+00
## 0.92        226 4.190e-01
## 0.93        142 2.633e-01
## 0.94         59 1.094e-01
## 0.95         65 1.205e-01
## 0.96        103 1.910e-01
## 0.97         59 1.094e-01
## 0.98         31 5.747e-02
## 0.99         23 4.264e-02
## 1          1558 2.888e+00
## 1.01       2242 4.156e+00
## 1.02        883 1.637e+00
## 1.03        523 9.696e-01
## 1.04        475 8.806e-01
## 1.05        361 6.693e-01
## 1.06        373 6.915e-01
## 1.07        342 6.340e-01
## 1.08        246 4.561e-01
## 1.09        287 5.321e-01
## 1.1         278 5.154e-01
## 1.11        308 5.710e-01
## 1.12        251 4.653e-01
## 1.13        246 4.561e-01
## 1.14        207 3.838e-01
## 1.15        149 2.762e-01
## 1.16        172 3.189e-01
## 1.17        110 2.039e-01
## 1.18        123 2.280e-01
## 1.19        126 2.336e-01
## 1.2         645 1.196e+00
## 1.21        473 8.769e-01
## 1.22        300 5.562e-01
## 1.23        279 5.172e-01
## 1.24        236 4.375e-01
## 1.25        187 3.467e-01
## 1.26        146 2.707e-01
## 1.27        134 2.484e-01
## 1.28        106 1.965e-01
## 1.29        101 1.872e-01
## 1.3         122 2.262e-01
## 1.31        133 2.466e-01
## 1.32         89 1.650e-01
## 1.33         87 1.613e-01
## 1.34         68 1.261e-01
## 1.35         77 1.428e-01
## 1.36         50 9.270e-02
## 1.37         46 8.528e-02
## 1.38         26 4.820e-02
## 1.39         36 6.674e-02
## 1.4          50 9.270e-02
## 1.41         40 7.416e-02
## 1.42         25 4.635e-02
## 1.43         19 3.522e-02
## 1.44         18 3.337e-02
## 1.45         15 2.781e-02
## 1.46         18 3.337e-02
## 1.47         21 3.893e-02
## 1.48          7 1.298e-02
## 1.49         11 2.039e-02
## 1.5         793 1.470e+00
## 1.51        807 1.496e+00
## 1.52        381 7.063e-01
## 1.53        220 4.079e-01
## 1.54        174 3.226e-01
## 1.55        124 2.299e-01
## 1.56        109 2.021e-01
## 1.57        106 1.965e-01
## 1.58         89 1.650e-01
## 1.59         89 1.650e-01
## 1.6          95 1.761e-01
## 1.61         64 1.187e-01
## 1.62         61 1.131e-01
## 1.63         50 9.270e-02
## 1.64         43 7.972e-02
## 1.65         32 5.933e-02
## 1.66         30 5.562e-02
## 1.67         25 4.635e-02
## 1.68         19 3.522e-02
## 1.69         24 4.449e-02
## 1.7         215 3.986e-01
## 1.71        119 2.206e-01
## 1.72         57 1.057e-01
## 1.73         52 9.640e-02
## 1.74         40 7.416e-02
## 1.75         50 9.270e-02
## 1.76         28 5.191e-02
## 1.77         17 3.152e-02
## 1.78         12 2.225e-02
## 1.79         15 2.781e-02
## 1.8          21 3.893e-02
## 1.81          9 1.669e-02
## 1.82         13 2.410e-02
## 1.83         18 3.337e-02
## 1.84          4 7.416e-03
## 1.85          3 5.562e-03
## 1.86          9 1.669e-02
## 1.87          7 1.298e-02
## 1.88          4 7.416e-03
## 1.89          4 7.416e-03
## 1.9           7 1.298e-02
## 1.91         12 2.225e-02
## 1.92          2 3.708e-03
## 1.93          6 1.112e-02
## 1.94          3 5.562e-03
## 1.95          3 5.562e-03
## 1.96          4 7.416e-03
## 1.97          4 7.416e-03
## 1.98          5 9.270e-03
## 1.99          3 5.562e-03
## 2           265 4.913e-01
## 2.01        440 8.157e-01
## 2.02        177 3.281e-01
## 2.03        122 2.262e-01
## 2.04         86 1.594e-01
## 2.05         67 1.242e-01
## 2.06         60 1.112e-01
## 2.07         50 9.270e-02
## 2.08         41 7.601e-02
## 2.09         45 8.343e-02
## 2.1          52 9.640e-02
## 2.11         43 7.972e-02
## 2.12         25 4.635e-02
## 2.13         21 3.893e-02
## 2.14         48 8.899e-02
## 2.15         22 4.079e-02
## 2.16         25 4.635e-02
## 2.17         18 3.337e-02
## 2.18         31 5.747e-02
## 2.19         22 4.079e-02
## 2.2          32 5.933e-02
## 2.21         23 4.264e-02
## 2.22         27 5.006e-02
## 2.23         13 2.410e-02
## 2.24         16 2.966e-02
## 2.25         18 3.337e-02
## 2.26         15 2.781e-02
## 2.27         12 2.225e-02
## 2.28         20 3.708e-02
## 2.29         17 3.152e-02
## 2.3          21 3.893e-02
## 2.31         13 2.410e-02
## 2.32         16 2.966e-02
## 2.33          9 1.669e-02
## 2.34          5 9.270e-03
## 2.35          7 1.298e-02
## 2.36          8 1.483e-02
## 2.37          6 1.112e-02
## 2.38          8 1.483e-02
## 2.39          7 1.298e-02
## 2.4          13 2.410e-02
## 2.41          5 9.270e-03
## 2.42          8 1.483e-02
## 2.43          6 1.112e-02
## 2.44          4 7.416e-03
## 2.45          4 7.416e-03
## 2.46          3 5.562e-03
## 2.47          3 5.562e-03
## 2.48          9 1.669e-02
## 2.49          3 5.562e-03
## 2.5          17 3.152e-02
## 2.51         17 3.152e-02
## 2.52          9 1.669e-02
## 2.53          8 1.483e-02
## 2.54          9 1.669e-02
## 2.55          3 5.562e-03
## 2.56          3 5.562e-03
## 2.57          3 5.562e-03
## 2.58          3 5.562e-03
## 2.59          1 1.854e-03
## 2.6           3 5.562e-03
## 2.61          3 5.562e-03
## 2.63          3 5.562e-03
## 2.64          1 1.854e-03
## 2.65          1 1.854e-03
## 2.66          3 5.562e-03
## 2.67          1 1.854e-03
## 2.68          2 3.708e-03
## 2.7           1 1.854e-03
## 2.71          1 1.854e-03
## 2.72          3 5.562e-03
## 2.74          3 5.562e-03
## 2.75          2 3.708e-03
## 2.77          1 1.854e-03
## 2.8           2 3.708e-03
## 3             8 1.483e-02
## 3.01         14 2.595e-02
## 3.02          1 1.854e-03
## 3.04          2 3.708e-03
## 3.05          1 1.854e-03
## 3.11          1 1.854e-03
## 3.22          1 1.854e-03
## 3.24          1 1.854e-03
## 3.4           1 1.854e-03
## 3.5           1 1.854e-03
## 3.51          1 1.854e-03
## 3.65          1 1.854e-03
## 3.67          1 1.854e-03
## 4             1 1.854e-03
## 4.01          2 3.708e-03
## 4.13          1 1.854e-03
## 4.5           1 1.854e-03
## 5.01          1 1.854e-03
## Total     53940 1.000e+02
# Can you build a histogram for cut?  Why not?

# Look closer at smaller carat diamonds (left portion from previous histogram)

smaller <- diamonds %>% 
  filter(carat < 3)

# Set small binwidth
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1) + theme_bw()

# Polygon
ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
  geom_freqpoly(binwidth = 0.1) + theme_bw() +
  scale_color_brewer(palette = "Spectral") 

  1. Scatterplot Matrix
## Gentle Machine Learning
## Scatter plot matrix
## Extracted from Alexander C. Tan, Karl Ho & Cal Clark. 2020. The political 
## economy of Taiwan’s regional relations, Asian Affairs: An American Review


# Check packages
doInstall <- TRUE  # For checking if package is installed
toInstall <- c("openxlsx", "tidyverse", "RColorBrewer", "GGally")
if(doInstall){install.packages(toInstall, repos = "http://cran.us.r-project.org")}
## 
## The downloaded binary packages are in
##  /var/folders/qp/s6y46pq11y13t0gpnf4_v9vm0000gp/T//Rtmpazzsxp/downloaded_packages
lapply(toInstall, require, character.only = TRUE) # call into library
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
# Import data from GitHub
imfgrowth = openxlsx::read.xlsx("https://github.com/datageneration/gentlemachinelearning/raw/master/data/imfgrowth.xlsx")

attach(imfgrowth)

# imfgrowth = rename(imfgrowth, US = "United.States")
imf8019 = imfgrowth[which(imfgrowth$Year<2020),]
imf8019$decade = as.factor(imf8019$decade) # Change decade into factor
attach(imf8019)

# Create group for comparison
# NSP is the countries targeted by Taiwan in its New Sound Bound Policy (2016)
tcuan = data.frame(China, Taiwan, United.States, NSP, ASEAN)

# Pairwise scatterplot matrix
# Specifying font, subject to font availabiliy on local computer
ggpairs(tcuan) + theme_bw() +  
  theme(text = element_text(size=12,  family = "Palatino"))

## Bivariate scatterplots with regression line
ggduo(
  tcuan,
  types = list(continuous = "smooth_lm")) + theme_bw()

## Scatter plot matrix
## Choose variables to be plotted
ggscatmat(imf8019, columns = 20:24,  alpha = 0.8) + 
  theme_bw() +  
  theme(text = element_text(size=12,  family = "Palatino"), ) + 
  labs(y = "Economic growth, 1980-2018",x = "Economic growth, 1980-2018") +
  scale_fill_brewer(palette="Set1") + scale_color_brewer(palette="Set1") 

  1. Ensemble
## Gentle Machine Learning
## Scatter plot matrix
## Adapted from example in Unwin, Antony.2015. Graphical data analysis with R. Vol. 27. CRC Press.

#doInstall <- TRUE  # For checking if package is installed
#toInstall <- c("pgmm", "tidyverse", "pdp", "GGally", "grid", "gridExtra")
#if(doInstall){install.packages(toInstall, repos = "http://cran.us.r-project.org")}
# lapply(toInstall, library, character.only = TRUE) # call into library


library(pgmm)
library(tidyverse)
library(pdp)
library(GGally)
library(grid)
library(gridExtra)

# Load data
# Data on the chemical composition of coffee samples collected from around the 
# world, comprising 43 samples from 29 countries. Each sample is either of the 
# Arabica or Robusta variety. Twelve of the thirteen chemical constituents 
# reported in the study are given. 
# The omitted variable is total chlorogenic acid; it is generally the sum of 
# the chlorogenic, neochlorogenic and isochlorogenic acid values.

data(coffee, package="pgmm")
coffee <- within(coffee, Type <- ifelse(Variety==1,
                                        "Arabica", "Robusta"))
names(coffee) <- abbreviate(names(coffee), 8)
a <- ggplot(coffee, aes(x=Type)) + geom_bar(aes(fill=Type)) +
  scale_fill_manual(values = c("grey70", "red")) +
  guides(fill=FALSE) + ylab("") +
  theme_bw() +
  theme(text = element_text(family="Palatino")) 
 
b <- ggplot(coffee, aes(x=Fat, y=Caffine, colour=Type)) +
  geom_point(size=2) +
  scale_colour_manual(values = c("grey70", "red")) +
  theme_bw() +
  theme(text = element_text(family="Palatino"))
 
c <- ggparcoord(coffee[order(coffee$Type),], columns=3:14,
                groupColumn="Type", scale="uniminmax",
                mapping = aes(size = 1), splineFactor = TRUE ) +
  xlab("") +  ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("grey","red")) +
  theme_bw() +
  theme(text = element_text(family="Palatino")) 

# Combine into one page using grid
grid.arrange(arrangeGrob(a, b, ncol=2, widths=c(1,2)),
             c, nrow=2) 

References

Unwin, Antony. 2015. Graphical data analysis with R. Boca Raton, FL: CRC Press. Grolemund, Garrett, and Hadley Wickham. 2018 R for data science."* (https://r4ds.had.co.nz/).